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Abstract 

The cross-depiction problem is that of recognising 
visual objects regardless of whether they are pho¬ 
tographed, painted, drawn, etc. It is a potentially sig¬ 
nificant yet under-researched problem. Emulating the 
remarkable human ability to recognise objects in an as¬ 
tonishingly wide variety of depictive forms is likely to 
advance both the foundations and the applications of 
Computer Vision. 

In this paper we benchmark classification, domain 
adaptation, and deep learning methods; demonstrat¬ 
ing that none perform consistently well in the cross¬ 
depiction problem. Given the current interest in 
deep learning, the fact such methods exhibit the 
same behaviour as all but one other method: they 
show a significant fall in performance over inhomoge¬ 
neous databases compared to their peak performance, 
which is always over data comprising photographs only. 
Rather, we find the methods that have strong models 
of spatial relations between parts tend to be more ro¬ 
bust and therefore conclude that such information is 
important in modelling object classes regardless of ap¬ 
pearance details. 

1 Introduction 

Humans are able to recognise objects in an astonish¬ 
ing variety of forms. Whether photographed, drawn, 
painted, carved in wood, people can recognise horses, 
elephants, people, etc. The same is not true of com¬ 
puters, even the very best recognition algorithms in¬ 
cluding deep learning - exhibit a significant drop in 
performance when presented with an inhomogeneous 
data set, and fall further still when trying to recognise a 
drawn object after being trained only on photographic 
examples. 

Cross-depiction forces one to consider which visual 
attributes are necessary for recognition, and which are 
merely sufficient. To illustrate this: humans can recog¬ 


nise trains in full colour photographs, as vague fuzzy 
blobs in paintings such as Rain, Steam, and Speed by 
J.M.W. Turner, in sketchy line drawings, as a simpli¬ 
fied silhouette in UK road signs. Ostensibly at least, 
these vastly different depictions of a train have nothing 
in common except (of course) each of them shows a 
recognisable train. 

It is clear that specific appearance is able to vary 
significantly - to a much greater degree than due to 
lighting changes, for example - and still people can 
recognise objects. Childrens’ drawings, as in Figure [T| 
are both highly abstract and highly variable, yet con¬ 
tain sufficient information for objects to be recognised 
by humans, but not computers. 

Equally clearly, learning the specifics of each depic¬ 
tion is at best unappealing, not least because the gamut 
of possible depictions is potentially unlimited. Rather, 
the question is what abstraction do these classes have 
in common that allow then to be recognised regardless 
of depiction? It is this and similar questions that push 
at the foundations of Computer Vision. 

A machine that is able to recognise regardless of de¬ 
piction would provide a significant boost to current ap¬ 
plications, such as image search and rendering. For ex¬ 
ample, given a photograph of the Queen of England, 
a search should return all portraits of her, including 
postage stamps that capture her likeness in bas-relief. 
Searching heterogeneous data sets is a real problem 
for the creative industries, because they archive vast 
quantities of material in a huge variety of depictions - 
a problem that requires visual class models that spans 
depictive styles. Non-photorealistic rendering from im¬ 
ages and video would be boosted too, not least be¬ 
cause highly aesthetic renderings depend critically on 
the level of abstraction available to algorithms. Pic¬ 
ture making is nothing like tracing over photographs: 
humans draw what they know of an object, not what 
they see - computers should do like wise. 

This paper: (1) establishes that there is a literature 
gap; (2) provides two databases designed for the cross 
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Figure 1: Childrens’ drawings. 


depiction problem; (3) provides experimental evidence 
that no current method copes with the cross-depiction 
problem for either classification or detection; (4) pro¬ 
vides an empirically based explanation of the experi¬ 
mental results; (5) suggests possible ways ahead based 
on all the data in this paper. 

As a note: in this paper, we use the term photograph 
as a short hand for “natural image”, and the term art¬ 
work to refer to all other images. 

2 Related Literature 

There is a vast literature in Computer Vision that ad¬ 
dresses the problem of recognition , by which we mean 
both classification (does this image contain an object 
of class X, or not?) and detection (an object of class X 
is at this place in this image). Yet almost no prior art 
addresses the cross-depiction, which is surprising given 
its genuine potential for advancing Computer Vision 
both in its foundations and in its applications. 

Of the many approaches to visual object classifica¬ 
tion, the bag-of-words (BoW) family JU 0T} 09] is 
amongst the most widespread. It models visual object 
classes as histograms of visual words; these words be¬ 
ing clusters in feature space. Although the BoW meth¬ 
ods address many difficult issues, they tend to gener¬ 
alise poorly across depictive styles (see Subsection l5.ll) . 
Alternative low-level features such as edgelets ED EH 
may be considered, or mid-level features such as re¬ 
gion shapes ED ED- These features offer a little 
more robustness, but only if the silhouette shape is 
constrained - and only if the picture offers discernible 
edges, which is not the case for many artistic pictures 
(Turner’s paintings, for example). 

Deformable models of various types are widely used 
to model object classes for detection tasks, including 
several kinds of deformable template models urn and 
a variety of part-based models [D EH ED ED ESI ED II2 • 
In the constellation models from |23| . parts are con¬ 
strained to be in a sparse set of locations, and their geo¬ 


metric arrangement is captured by a Gaussian distribu¬ 
tion. In contrast, pictorial structure models EDEEHETi 
define a matching problem where parts have an in¬ 
dividual match cost in a dense set of locations, and 
their geometric arrangement is captured by a set of 
spring connecting pairs of parts. In those methods, the 
Deformable Part-based Model (DPM) [215], is widely 
used. It describes an object detection system based on 
mixtures of multi-scale deformable part models plus a 
root model. By modelling objects from different views 
with distinct models, it is able to detect large varia¬ 
tions in pose. None of these directly address the cross¬ 
depiction problem. 

Shape has also been considered. Leordeanu et al. 
[431 encode relations between all pairs of edgels of shape 
to go beyond individual edgels. Similarly, Elidan et 
al. El use pairwise spatial relations between land¬ 
mark points. Ferrari et al. |25] propose a family of 
scale invariant local shape features formed by short 
chains of connected contour segments. Shape skeletons 
are the dual of shape boundary, and also have been 
used as a descriptor. For example, Rom and Medioni 
[48] suggest a hierarchical approach for shape descrip¬ 
tion, combining local and global information, to ob¬ 
tain skeleton of shape. Sundar et al. [55] use skeletal 
graph to represent shape and use graph matching tech¬ 
niques to match and compare skeletons. Shock graph 
|53j is derived from skeleton models of shapes, and fo¬ 
cus on the properties of the surrounding shape. Shock 
graphs are obtained as a combination of singularities 
that arise during the evolution of a grassfire transform 
on any given shape. In particular, the set of singulari¬ 
ties consists of corners, lines, bridges and other similar 
features. Shock graphs are then organised into shock 
trees to provide a rich description of the shape. 

Algorithms usually assume that the training and test 
data are drawn from the same distribution. This as¬ 
sumption may be breached in real-world applications, 
leading to domain-adaptation methods such as transfer 
component analysis (TCA) [4B], which transfer com¬ 
ponents from one domain to another. Both sampling 
geodesic flow (SGF) ED and geodesic flow kernel 
(GFK) ED use intermediate subspaces on the geodesic 
flow connecting the source and target domain. GFK 
represents state-of-the-art performance on the stan¬ 
dard cross-domain dataset [24]; it has been used to 
classify photographs acquired under different environ¬ 
mental conditions, at different times, or from different 
viewpoints. 

Cross-depiction problems are comparatively less well 
explored. Some work is very specific - Crowley and 
Zisserman take a weakly supervised approach, using 
a DPM to learn figurative art on Greek vases 13|. 
Others develop the problem of searching a database 
of photographs based on a sketch query; edge-based 
HoG was explored in ES]> Li et al. [411 • Other have 
investigated sketch based retrieval of video ED 13- 

Approaches to the more general cross depiction prob¬ 
lem are rare. Matching visually similar images has 
been addressed using self similarity descriptors ED ■ It 
relies on a spatial map built from correlations of small 
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Figure 2: Top: Photo-Art-50 dataset: containing 50 object categories. Each category is displayed with one art 
image and one photo image. Bottom: People-Art databases, designed for detection. 


patches; it therefore encodes a spatial distribution, but 
tends to be limited to small rigid objects. Crowley and 
Zisserman m provide the only example of domain 
adaptation we know of specifically designed for the 
cross depiction problem; they train on photographs and 
then use midlevel patches to learn spatial consistencies 
(scale and translation) that allow matching from pho¬ 
tographs into artwork. Their method performs well 
in retrieval tasks for 11 object classes in databases of 
paintings. 

Classification, rather than matching, has also been 
studied. Slirivista et al [ 52] show that an Exemplar 
SVM trained on a huge database is capable of classifi¬ 
cation of both photographs and artwork. A less compu¬ 
tationally intensive approach has been proposed m 
using a hierarchical graph model to obtain a coarse-to- 
fine arrangement of parts with nodes labelled by quali¬ 
tative shape [60] ■ Wu et al address the cross-depiction 
problem using a deformable model [32]; they use a fully 
connected graph with learned weights on nodes (the 
importance of a nodes to discriminative classification), 
on edges (by analogy, the stiffness of a spring connect¬ 
ing parts), and multiple node labels (to account to dif¬ 
ferent depictions); a method tested on 50 categories. 
Others use no labels at all, but rely on connection 
structure alone 2] or distances between low-level parts 

SSI- 

Deep learning has recently emerged as a truly sig¬ 


nificant development in Computer Vision. It has been 
successful on conventional databases, and over a wide 
range of tasks, with recognition rates in excess of 90%. 
Deep learning has been used for the cross-depiction 
problem, but its success is less clear cut. Crowley 
and Zisserman EH are able to retrieve paintings in 10 
classes at a success rate that does not rise above 55%; 
their classes do not include people. Ginosar et al [25] 
use deep learning for detecting people in Picasso paint¬ 
ings, achieving rates of about 10%. 


Other than this paper, we know of only two stud¬ 
ies assessing the performance of well established meth¬ 
ods on the cross depiction problem. Crowley and 
Zisserman [12] use a subset of the ‘Your Paintings’ 
dataset [3] , the subset decided by those that have been 
tagged with VOC categories m ■ Using 11 classes, and 
objects that can only scale and translate, they report 
an overall drop in per class Prec@k (at k = 5) from 
0.98 when trained and tested on paintings alone, to 
0.66 when trained on photographs and tested on paint¬ 
ings. Hu and Collomosse [35] use 33 shape categories in 
Flickr to compare a range of descriptors SIFT, multi¬ 
resolution HOG, Self Similarity, Shape Context, Struc¬ 
ture Tensor, and (their contribution) Gradient Field 
HOG. They test a collection of 8 distance measures, 
reporting low mean average precision rates in all cases. 
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3 Data Sets 

To date there is no accepted publicly available database 
that has been specifically designed for the cross depic¬ 
tion problem. In this paper we use two annotated im¬ 
age datasets, both designed for the evaluation cross¬ 
depiction algorithms in classification and detection, 
samples can be seen in Figure[2] each is explained next. 

3.1 Multi-class Set: Photo-Art-50. 

This dataset is designed for classification problems. It 
contains 50 object classes, with between 90 and 138 
images for each class. Each class is approximately half 
photographs and half artwork. All 50 classes appear 
in Caltech-256; a few also appear in PASCAL VOC 
Challenge [19] and ETH-Shape dataset l[26]. 

Some of the photographs come from Caltech-256, 
the rest from Google search. Arworks were searched 
using a few keywords to cover a wide gamut of depic¬ 
tion styles, e.g., ‘horse cartoon’, ‘horse drawing’, ‘horse 
painting’, ‘horse sketches’, ‘horse child drawing’, etc. 
We manually selected images with a reasonable size of 
a meaningful object area. We further manually provide 
the ground-truth bounding boxes. 

3.2 Detection Set: People-Art 

However, one problem with building a dataset for cross¬ 
depiction is that objects classes do not appear with 
equal abundance in artwork. People tend to draw some 
object classes far more frequently than others - people 
draw people a great deal, but artwork showing head¬ 
phones and beer-mugs (both classes in Photo-Art-50, 
both in Caltech 256) is harder to come by, and (anecdo¬ 
tally/by observation) appear in relatively few depictive 
styles. Therefore our second database contains only 
people; it is better suited to detection problems than 
to classification problems. 

It consists only of people, in 43 different 
styles, among which 41 styles are downloaded from 
wikipaintings.org website, one cartoon style from 
google search and one photographic style from PAS¬ 
CAL VOC2012. The dataset is divided into training 
set, validation set and test set. The training set has 
1627 images, among which 1324 ‘person’ objects are 
annotated in 521 images. The validation set has 1387 
images, among which 1080 person objects are anno¬ 
tated in 442 images. The test set has 1617 images, 
among which 1083 person objects are annotated in 520 
images. 

This dataset represents a much wider gamut of depic¬ 
tive styles than Photo-Art-50. Additionally, the people 
in the artwork appear in far greater variety of poses 
than is common in photographs. 

4 Benchmarked Algorithms 

We benchmark several algorithms for classification and 
several for detection. Our purpose is not to act as 
advocates for any method, but to characterise current 


understanding with regard to the cross-depiction prob¬ 
lem. The algorithms we report are not exhaustive: the 
area is far too well researched for that (at least us¬ 
ing photographic databases). Rather, we have selected 
methods on the grounds of historical importance, cur¬ 
rent popularity, state of the art performance - and have 
included some inventions of our own. In addition to the 
methods reported, we also tested other alternatives m 
HO [S] E2 but none worked sufficient well to report 
here. 

4.1 Bag of Words (BoW) 

There are many variants of BoW methods, see Sec¬ 
tion [2] We use Csurka et al’s version El; because it 
is well known, widely used, and classifies photographs 
well. Given a set of labelled training images, local de¬ 
scriptors are computed on a regular grid with multiple¬ 
sized regions. A vocabulary of words is constructed by 
vector quantisation of local descriptors with k-means 
clustering (k = 1000). To construct a visual class 
model (VCM) each image is partitioned into L lev¬ 
els of increasingly fine cells (L = 2 in our experi¬ 
ments). A histogram of word occurrences is built for 
each cell; concatenating these histograms encodes the 
image with a 5000 dimensional vector. A one-versus-all 
linear SVM classifier is trained on a y 2 -homogeneous 
kernel map [55] of all training histograms. Given a test 
image the local features are extracted in the same way 
as in the training stage, mapped onto the codebook to 
build a multi-resolution histogram, which is then clas¬ 
sified with the trained SVM. 

Choice of feature may be important to the cross de¬ 
piction problem (see Section Therefore we test a 
collection of distinct features, as follows: 

SIFT gS] is a 128-dimensional vector created by 
stacking 8-bin orientation histograms on 4 x 4 cells. We 
use the implementation of dense-SIFT in m and sam¬ 
ple SIFT with four region sizes on a regular grid with 3 
pixels step. Geometric Blur (GB) g] describes local 
regions by geometrically blurring oriented edge maps. 
It is able to match object parts with very different ap¬ 
pearance in two images. We follow the original setup 
in gj. Self-similarity desciptors (SSD) [511(5] mea¬ 
sure local self-similarity patterns by correlating a tiny 
local patch (typically 5x5) within a larger local region. 
It computes local correlations of patches rather than 
pixel values, and performance well at matching similar 
objects invariant to depictive styles. We include it in 
the BoW framework to observe its behaviour in cross¬ 
depiction classification. We follow the default parame¬ 
ter settings from (5] except that we use 4 region sizes to 
capture a wider variation of local patterns. Histogram 
of Oriented Gradient (HOG) 115; is a vector of nor¬ 
malised histograms from tiled block regions. It is the 
most effective feature in the context of object detec¬ 
tion gT( and also the most favored local feature in the 
context of sketch-based retrieval mmm- We com¬ 
pute HOG using the VLFeat m implementation. The 
gradients are quantised into 9 orientations and four 
cell sizes are used. edgeHOG for comparison due to 
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its effectiveness in sketch-based retrieval [35]. Unlike 
standard HOG which extracts the descriptor on the 
original image map, edgeHOG computes the gradient 
orientation histograms over edge maps. 

Fisher Vectors (FV) [TF], strictly speaking, are 
not BoW. Instead of counting the words occurrence 
(as in BoW). Given a set of local feature vectors (we 
use SIFT) extracted from training images, fitted a 
K = 256-component GMM to their distribution. The 
FV of an image is the stack of the mean and covari¬ 
ance deviation into a vector. We follow 0ZI. applying 
Hellinger’s kernel to each dimension of the Fisher vec¬ 
tor followed by ^-normalisation. Like BoW, a spa¬ 
tial pyramid (identical) is used. Then, a one-versus-all 
linear SVM classifier is trained on the Fisher vectors 
obtained from all training images 

4.2 Domain Adaptation 

Photographs and art can be seen as belonging to differ¬ 
ent domains. Excellent domain adaptive methods, not 
limited to [33] [TTT] [M] [501 [3D] , show clear benefits for 
photographs captured under different conditions. Two 
state of the art methods are evaluated on our dataset: 

Geodesic Flow Kernel^ GFK) models the source 
domain S and target domain T with lower dimensional 
linear subspaces and embeds them onto a Grassmann 
manifold. Geodesic flow is parameterized as a curve 
between these two subspaces, see Gong et al [31]. We 
used two variants of GFK kernels: GFK_PCA and 
GFK_LDA. In GFK_PCA the original features are pro¬ 
jected onto the 49 dimensional subspace, with PCA 
on each domain. GFK_LDA replaces PCA with a 
supervised dimension reduction method - linear dis¬ 
criminant analysis (LDA) - on the source domain. 
Subspace Alignment (SA) [33] project S and T to 
respective subspaces. Then, a linear transformation 
function is learned to align the two spaces. 

4.3 Part Based Models 

Part based models use larger scale ‘features’, and take 
spatial relationship between these parts into account. 

Deformable Parts Model DPM [21] is a state of the 
art representation. It models an object with a star 
graph, i.e., a root filter plus a set of parts. Given 
the location of the root and the relative location of n 
parts; n = 8 in our experiments. The score of the star 
model is the sum of responses of the root filter and 
parts filters, minus the displacement cost. Each node 
in a DPM is labelled with a HoG feature, learned from 
examples. 

By analogy with domain adaptation, we considered 
the possibility of query expansion for DPM to obtain 
Adapted DPM (ADPM). We first train a standard 
DPM model for each object category in the training 
set ( i.e ., source domain) S. We then apply the models 
on the test set {i.e., target domain) T. A confidence 
set C C T is constructed from the test set for training 
expansion by picking images that match a particular 


VCM especially well: 

C = {.x G T\s\(x) > 9\ A si(a;) — s 2 (x) > 9 2 } (1) 

with Si(x) the highest DPM score greater, and s 2 {x) 
the next highest score, and (A, 02 are user-specified pa¬ 
rameters to threshold the best score and margin re¬ 
spectively. We found 9\ = —0.8 and 9 2 = 0.1 to be a 
good trade-off between minimising false positives ( 5%) 
and including appropriate number of expanded data 
(around 580 images in C). 

The fully connected multi-labelled graph MG model 
of Wu et al [59] is designed for the cross-depiction 
problem. It attempts to separate appearance features 
(contingent on the details of a particular depiction) 
from the information that characterises an object class 
without reference to any depiction. Unlike DPM, it 
comprises a fully connected weighted graph, and has 
multiple labels per node. Each graph has eight nodes. 
Weights on nodes can be interpreted as denoting the 
importance of a node to object class characterisation in 
a way that is independent of depiction. Weights on arcs 
are high if the distance between the connected pairs of 
parts varies little. These weights are learned using a 
structural support vector machine [Bj. In addition to 
the weights, each node carries 2 features labels. These 
are designed to characterise the appearance of parts in 
both photographs and artwork (see the Discussion |T| 
for a justification). 


4.4 Deep Learning 

Convolutional neural networks (CNN) gU] has yielded 
a significant performance boost on image classification. 
To adapt the CNN to the object detection task, Gir- 
shick et al. [29] proposed R-CNN (Regions with CNN 
features) by combining region proposals with CNNs. 
As the annotated data is scarce, it is insufficient to 
train a large CNN. 

The solution we use is standard practice. The 
CNN parameters are first initialised by supervised pre¬ 
training from the large ILSVRC2013 dataset, then fine- 
tuned on the annotated regions. To be precise, the 
R-CNN method works in three steps. The first gen¬ 
erates around 2000 category-independent region pro¬ 
posals by selective search m- The second step is to 
extract a fixed-length feature vector for each wrapped 
region by forwarding it into a pre-trained AlexNet [40] , 
More specifically, we use the output of the last fully- 
connected layer (fc-7) as the region features. The third 
step is class-specific linear SVM. 

For classification, we follow Crowley and Zisser- 
man m , encoding images in each class with CNN 
features, which are then used as input to learn a one- 
vs-all linear SVM classifier. For Detection, we run the 
experiments with R-CNN codes m downloaded from 
the authors website. The CNN architectures and the 
fine-tuning are implemented using the publicly avail¬ 
able Caffe [3D] . 
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model 

BoW 

FV 

DPM 

MG 

CNN 

train 

test 

SIFT 

GB 

SSD 

HOG 

eHOG 

SIFT 

HOG 

2xHOG 

learned 

P 

P 

84 

77 

66 

72 

70 

87 

88 

85 

97 

M 

P 

80 

72 

58 

65 

63 

84 

85 

90 

96 

A 

P 

64 

60 

39 

42 

50 

66 

78 

83 

91 

A 

A 

74 

72 

49 

55 

60 

77 

83 

89 

89 

M 

A 

69 

67 

45 

50 

56 

73 

80 

89 

87 

P 

A 

44 

50 

31 

29 

40 

47 

68 

83 

73 


Table 1: Classification Benchmarks on Photo-Art-50. Each row is a train (30 image) / test (rest) pattern: Art, 
Photo, Mixed. Each column is an algorithm with feature, divided into groups: BoW [451 31 [52 DEI 35 [47] . 
parts-based EH [59] . deep learning with CNN m- Each cell shows the mean of 5 randomized trials. The 
standard deviation on any column never rises above 2%. 


train 

test 

SVM 

PCA.S 

PCA T 

GFK.PCA 

GFK.LDA 

SA 

ADMP 

Art 

Photo 

54 

46 

48 

48 

50 

45 

84 

Photo 

Art 

36 

30 

31 

31 

32 

29 

78 


Table 2: Domain Adaptation: These methods are designed to ‘jump’ from one domain to the other, therefore 
we restricted training to one domain and testing on the other. Each column is a train/test pattern. Each 
column is an algorithm: SVM is Linear SVM using SIFT features; PCA_S is SVM with PCA isn source domain 
only; PCA_T is SVM with PCA on target only; GFK_PCA and GGK_LDA is GFK [31] with PCA and LDA 
on feature; SA is subspace alignment [2i| . 


5 Classification Benchmarks 

We use Photo-Art-50 for classification benchmarking, 
with a variety of algorithms. We considered six dif¬ 
ferent train/test patterns, given by the different com¬ 
binations of training on photographs, art, or a mixed 
set; and testing on photos or art alone. In all cases we 
repeated the experiment 5 times, randomly selecting 
30 images for training, using the rest for testing. 

For BoW and FVs: for each descriptor we built 
an SVM classifier using a % 2 kernel. Domain adap¬ 
tation is about moving from one domain to another, 
so we only benchmarked photograph to art, and vice- 
versa. To monitor any affect of domain adaptation we 
built a control classifier: a linear SVM using SIFT fea¬ 
tures (SVM). We also using principle component anal¬ 
ysis on both source (PCA_S) and target (PCA_T) do¬ 
mains individually to reveal the impact of PCA. We im¬ 
plemented geodesic kernel flow (GFK) |31| with PCA 
and LDA applied to data, and also subspace align¬ 
ment (SA) [24] as high quality domain adaptation al¬ 
gorithms. Deformable models are used to classify 
by scanning an image in an effort to detect each class 
- the class with the highest detection score is used as 
the class. We follow HU who use CNN for classifying 
art; we also include photographs. 

5.1 Results and Discussion for Classifi¬ 
cation 

Tables [L] and [2] show our results for the classification 
benchmarks. Each row is a different train/test pat¬ 
tern, and each column a local feature descriptor. Each 
cell shows the percentage of correct classifications, av¬ 
eraged over 5 runs, rounded to the nearest integer. 

General patterns emerge. It is clear the BoW and 


domain adaptation methods perform the least well. 
Models that take spatial relations into account per¬ 
form better. What is most striking and surprising is 
that models using inter-part distance alone as a fea¬ 
ture is comparable with R-CNN when photographs are 
used as the test set, and are the best performing of all 
when artwork is used as the test set (CLT excepted: 
classifier is important). 

For most algorithms in Table [l] training on pho¬ 
tographs and testing on photograph yields the highest 
performance. For all algorithms in Table [T] training 
photographs and testing on art proves the most diffi¬ 
cult case; training on art and testing on photographs 
is the most difficult case whenever the test set is re¬ 
stricted to photographs alone. 

Against this data, the domain adaptation methods 
offer no advantage - with the exception of domain- 
adapted DPM. We note Crowley and Zisserman [T2] 
report a similar pattern of findings for their adaptation 
model (which includes spatial and feature adaptation). 

Looking at details, the Fisher Vector is the best of 
the BoW-like method, which is consistent with obser¬ 
vations in m- EdgeHOG outperforms the standard 
HOG when trained on artwork, which is consistent with 
the observations of p~6] [35]. Gaussian Blur kernels also 
capture edge information, which may explain why they 
drop away the least in the photo/art train/test pattern. 
The SSD performance is possibly surprising - but re¬ 
sults similar to our own have been observed by others 
interested in sketch-based classification task [Ml Sl¬ 
its poor performance it possibly explained by its need 
for relatively rigid objects. 
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train 

test 

DPM 

ADPM 

MG 

Photo 

Photo 

96 

- 

89 

Art 

Photo 

80 

84 

85 

Art 

Art 

84 

- 

88 

Photo 

Art 

73 

78 

71 

Photo + Art 

Photo & Art 

84 


89 


Table 3: Detection Precision: Comparison of mean 
average precision (mAP) across photo-art domains on 
our dataset, 30 images per object for training. DPM 
and ADPM stand for DPM trained without and with 
cross-depiction expansion, respectively, MG is fully- 
connected multi-labelled graph. 


method 

fine-tuning 

test 

AP 

DPM 

- 

art-test 

32 

R-CNN 

PASCAL VOC2012 

art-test 

26 

R-CNN 

art-trainval 

art-test 

40 


Table 4: Detection performances (average precision, 
AP) of DPM model and R-CNN model on Art-Person 
dataset. 

6 Detection Benchmarks 



We carry out object detection on Photo-Art-50 for 
DPM, adapted DPM (ADPM), and the fully-connected 
multi-labelled graph (MG). To make a detection in an 
image, we first construct a dense multi-scale feature 
pyramid. Then we locate each node of the class model 
(DPM, ADPM, MG) at the best k locations. We then 
build a structure used in matching, the exact scoring 
mechanism being determined by [2T; f° r DPM and 
ADPM and by [52] for MG. This is a deterministic 
algorithm, so was run exactly once per image. 

Detection rates on Photo-Art-50 are high, so we 
constructed the more challenging dataset, People-Art 
(Section [3]). We detected people using DPM, R-CNN 
without domain refinement, and R-CNN with domain 
refinement. We see general purpose DPM outperforms 
R-CNN, unless the deep learning network is refined on 
artwork. 

6.1 Results and Discussion for Detec¬ 
tion 

Table [3] shows results using the Photo-Art-50 dataset. 
Each row shows a training / testing pattern, and each 
column an algorithm. Each cell is the mean average 
precision (mAP) across 50 objects, with a standard 
deviation of 2%. It shows that DPM performs very 
well when photographs are used for both training and 
testing; which is consistent with the previous work ED- 
The fully connected multi-labeled graph (MG) outper¬ 
forms DPM in all other cases except the case when 
photographs form the training set and artwork is used 
for testing; but the standard deviation on the error 
is 2%, so the difference is not significant. . Echoing 
the classification task in Section [5] the performance 
of both DPM and MG drop significantly compared to 
other train/test patterns. The Adapted DPM shows to 


Figure 3: Above: each image in Photo-Art-50 plot¬ 
ted in an eigenspace spanning raw images, art in red, 
photos in blue. Below: The centre of each class in 
Photo-Art-50: red (art), blue(photo). The images and 
the cluster centres tend to form two groups: art/photo. 

best performance on the heterogeneous train/test pat¬ 
tern, and does so beyond the 2% deviation limit. This 
makes it somewhat significant. 

Results for detections on the People-Art dataset are 
shown in Table [3j These show that DPM outperforms 
the R-CNN machine trained on photographs alone, but 
once R-CNN is tuned to the People-Art dataset it out¬ 
performs DPM. 

7 General Discussion 

Across all classification and detection experiments we 
the same trend: a fall in performance in any case where 
art is included. This fall is most marked whenever pho¬ 
tographs are used for training and artwork for testing, 
and is seen in all cases other than the M-Graph [55] ■ 

These observation need an explanation. Intuition 
suggests that the difference between the low-level im¬ 
ages statistics of photographs and artwork differ is a 
cause. To investigate this we used all of the 10356 raw 
image in Photo-Art-50 and rescaled the to square im¬ 
ages with 256 pixels per edge. We then represented 
each image as a vector, and computed the covariance 
over all Art images and all photographs; the largest sin¬ 
gle eigenvalue for photograph is 2.65 x 10 6 , the largest 
single eigenvalue of artwork is 3.42 x 10 3 . This shows 
that the variance over artworks is about 1000 times 
greater than the variance over photographs. 
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Results from a more detailed version of this exper¬ 
iment can be seen in Table [5] There, the symmet¬ 
ric KL-divergence of different data sets that comprise 
two domains is computed. The domains C (Caltech- 
256), A (Amazon), W (Webcam), and D (DSLR) are 
all used in domain adaptation problems. The symmet¬ 
ric KL-divergence between these is shown along side 
the difference between art and photographs in Photo- 
Art-50. As can be seen, the distance between domains 
in Photo-Art-50 is by far the largest. This may ex¬ 
plain the difficulty domain adaptation methods appear 
to face when ‘jumping the gap’ in the cross-depiction 
problem, the gap is wider than the datasets usually 
used in domain adaptation. 

A stronger hypothesis is this. Let X be an object 
class and xp £ X be a photographic instance and xa is 
artwork instance of the that class. Similarly yp , yA £ Y 
are a photograph and artwork of class Y. Denote the 
set of all x p by Xp, meaning the ‘photo object’, etc. 
Suppose too there is a measure d{., .). We expect the 
intra-class distance (same domain, different class) to 
be less than the inter-class distance for (different do¬ 
main, same class). That is d(xp,XA) > d(xp,yp), 
d(xp,XA) > d(xA,VA), etc. ; which we call ‘atomic 
hypothesis statements”. To test this we used raw im¬ 
ages Photo-Art-50 as raw input, each scaled to a square 
image of pixel width 256. We then mapped all the data 
in a 4 dimensional space using PCA over all the data, 
and computed an eigenmodel for each ‘photo object’ 
Xp, and each ‘art object’ Xa- We assumed a K-NN 
classifier, so that Xp is represented by the mean, like¬ 
wise Xa, and the measure is Euclidean distance. We 
found that a fraction 0.67 of all possible atomic hy¬ 
pothesis statements to be true. This means the cen¬ 
tres of objects in different classes are expected to be 
closer than the centres of the same object in different 
domains, which may confuse feature based classifiers 
and explain our results above. 

These simple experiments help explain our bench¬ 
mark results - and results reported by others. They 
show that the distance between, artwork and pho¬ 
tographs for any given class of object is expected to be 
larger than between photographs of different classes. 
That is, the variation due to depiction is greater than 
the variation due to object identity. This accounts for 
the reason training on photographs alone and testing 
on art gives the poorest results - the photographs clus¬ 
ter into a relatively small region of feature space, and 
algorithms seem to over-fit the data in that small re¬ 
gion. Artwork, on the other had, tends to be more 
varied than photographs, which is why training on Art 
and testing on Photographs tend to be a little more 
robust but training on a mixed set gives clearly the 
best results, because the training data span the full 
variance. 

This wide variance in low-level statistics also helps 
explain the appeal of spatial information regarding ob¬ 
ject class identity. So far every method we have ex¬ 
perimented that uses some kind of spatial information 
shows less fall away in the cross-depiction problem; this 
is true also of m- In this paper we see DPM outper- 



Figure 4: The presence of a face depends on spatial 
arrangement of parts: above, no face; below smiling 
face. 


Cross-domain datasets [50] El] 

Photo-Art-50 

C-A 

C-D 

A-W 

D-A 

D-W 

Photo-Art 

0.079 

0.271 

0.239 

0.292 

0.047 

0.466 


Table 5: Comparison of symmetric K-L divergence 
'D( t P \, P 2 ) between domain pairs. Four domain sets 
in [FDj [51]: C - Caltech-256, A - Amazon, W - We¬ 
bCam, D - DSLR. 

form BoW, and the M-Graph outperform DPM. This 
result is in line with ( e.g .) Leordeanu et al. m who 
use the distance between low-level parts (edgelets) as 
a feature to characterise objects and achieve excellent 
detection results on the PASCAL dataset m of pho¬ 
tographs; it may be effective too on Photo-Art-50, but 
this is to be proven. 

This empirical data is supported anecdotally. The 
childrens’ drawings in Figure |T| are clearly people, but 
have little in common with photographs of people, and 
not much in common with one another. Consider too 
Figure[3]in which the same parts form a face, or not, de¬ 
pending only on the spatial arrangement of the parts. 
Indeed, artwork from prehistory to the present day, 
whether produced by a professional or a child, no mat¬ 
ter where in the world: the greater majority of it relies 
on spatial organisation for recognition. 

8 Applications 

We have already stated that a solution to the cross de¬ 
piction problem should support advanced application 
such as web search. It will also support applications 
such as advanced image editing, examples of which we 
provide in this section. 

Structure, spatial layout, and shape are all impor¬ 
tant characteristics in identifying objects. These same 













Figure 5: Shape abstraction for Automated Art. 


characteristics can also be used to generate artwork di¬ 
rectly from photographs. Consider Figure [3] it shows 
a photograph of a bird feeding its young. The photo¬ 
graph has been segmented, and the segments classified 
into one of a few qualitative shapes (square, circle, tri¬ 
angle, ...). In the most extreme case just one class 
(circle) is used. See [54] for details of the computer 
graphics algorithm. 

It is true that as the degree of abstraction grows the 
original interpretation of the image becomes harder to 
maintain; but given too the degree of abstraction in 
childrens’ drawings, the conclusion that both the qual¬ 
ity and quantity of abstraction is important for recog¬ 
nition. In this case the aim was only to produce a 
“pretty” image that bears some resemble to the orig¬ 
inal. However, simple qualitative shapes of the kind 
used here can be learned directly from segmentations, 
as are sufficient to classify scene type (indoor, outdoor, 

city ...) at close to state-of-the-art rates 103- 

Shape is not the only form of abstraction useful to 
the production of art, structure can be used too. Fig¬ 
ure [G] shows examples of computer generated art based 
on rendering structure. In this case the arcs of a graph 
have been visualised in a non-photorealistic manner, 
and the shape of parts at nodes have been classified 
into a qualitative shape; see [32[ for details. An al¬ 
most identical representation has been used for objects 
class recognition [6TJ. Even though a lower rates than 
reported above (around the mid 60% mark) the repre¬ 
sentation does not exhibit the “fall off” when trained 
on one domain and tested on another, as we have seen 
with all but one of the methods we have tested in this 
paper. 

9 Conclusion 

The cross depiction problem poses an important open 
problem for Computer Vision. Seeing, as understood 


outside the field, usually implies parsing a visual sig¬ 
nal into semantic objects (I can see it); in particular 
it makes no distinction between how those objects are 
depicted. Our results show that recognition algorithms 
premised directly on appearance suffer a fall in perfor¬ 
mance within the cross-depiction problem; probably 
because they tacitly assume limited variance of low- 
level statistics. 

All the results we have suggest that spatial organisa¬ 
tion between parts is significant with regard to object 
recognition. For example, DPM out-performs HOG- 
BoW, even though both use the same low level fea¬ 
tures; the M-Graph with a stronger spatial model 
- out-performs DPM. This is because, possibly, struc¬ 
ture and spatial layout capture the essential form of an 
object class, with specific appearance relegated to the 
level of detail. In other words, structure and space are 
more salient to robust identification that appearance. 
Indeed all algorithms we have tested show a significant 
fall compared to their own peak in performance, when 
trained on photographs and tested on art; this includes 
the deep learning methods we have used. The single 
exception is ([55J), which explicitly models a strong 
structure, and explains appearance details using mul¬ 
tiple labels on each node (multiple labels to account 
for both art and photographic appearance). 

Deep learning performs very well on classification 
over Photo-Art-50, but when presented with the prob¬ 
lem of people detection it suffers a significant drop 
in performance. These equivocal results make it dif¬ 
ficult to conclude that deep learning is a solution to 
the cross-depiction problem; more exactly, the deep 
learning methods we have tested do not solve cross¬ 
depiction. An alternative network may perform better. 

The cross-depiction problem pushes at the founda¬ 
tions of computer vision, and by doing so it enables new 
applications; potentially in search, certainly in render¬ 
ing. Given the fact that the same kind of represen¬ 
tations are used both for abstract rendering and for 
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Figure 6: Structure and Shape combine to make art in the style of (left to right) petroglyphs, child art, Joan 
Miro. 


recognition, the conclusion that there is a strong rela¬ 
tion between the two is hard to escape - made more 
difficult by the observation that people draw what they 
know not what they see. (When draughting in Art was 
considered important, students in art school had to be 
trained to draw what they see rather than what they 
know.) 

In summary: the cross-depiction problem pushes the 
envelope of computer vision research. It offers signifi¬ 
cant challenges, which if solved will support a range of 
applications. Modelling visual classes using structure 
and spatial relations seems to offer a useful way for¬ 
ward; the role of deep learning in the problem is yet to 
be fully proven. 
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